Skip to content

IPIP-499: UnixFS CID Profiles#499

Open
mishmosh wants to merge 41 commits intoipfs:mainfrom
mishmosh:patch-1
Open

IPIP-499: UnixFS CID Profiles#499
mishmosh wants to merge 41 commits intoipfs:mainfrom
mishmosh:patch-1

Conversation

@mishmosh
Copy link
Contributor

@mishmosh mishmosh commented Apr 3, 2025

Currently, CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID.

This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. They can be used to verify data across implementations, provide recommended settings depending on retrieval performance goals, and more.

@mishmosh mishmosh requested a review from a team as a code owner April 3, 2025 14:03
@mishmosh mishmosh changed the title Create ipip-0000.md: CID profiles IPIP 0499: CID Profiles Apr 3, 2025
lidel added a commit to ipfs/kubo that referenced this pull request Apr 15, 2025
lets make the fanout match the max links from files
and rename profile to `-wide`

this will make it easier to discuss in ipfs/specs#499
lidel and others added 2 commits April 15, 2025 23:41
Co-authored-by: Bumblefudge <bumblefudge@learningproof.xyz>
Import.* config params for controlling DAG width were added in:
ipfs/kubo#10774
@lidel
Copy link
Member

lidel commented Apr 15, 2025

Thank you for kicking this off, and filling initial state.

I've incorporated specific "dag width" settings for File, Directory and HAMTDirectory nodes,
and updated the table to reflect state from ipfs/kubo#10774
and profiles that exist in Kubo master branch: legacy-cid-v0, test-cid-v1 and test-cid-v1-wide:

Next:

  • agree what "cid-2025" profile should look like
    • this will be new default in "Kubo v1.0"
    • we have test-cid-v1 and test-cid-v1-wide in Kubo as potential candidates
  • switch to PR from local branch (so we have build preview)
  • figure out how to render the information (currently the table is not supported by https://github.com/ipfs/spec-generator)

@SethDocherty

This comment was marked as off-topic.

@github-actions
Copy link

github-actions bot commented Jan 16, 2026

🚀 Build Preview on IPFS ready

include singularity as example showing balanced layout has implementation
variants that affect CID determinism for large files:
- document balanced-packed DAG layout variant
  (data-preservation-programs/singularity#525)
- note boxo defaults for HAMT parameters
- note rclone defaults for hidden files and symlinks

This structural difference causes CID mismatches for files larger than `chunk_size * dag_width` (e.g., >1 GiB with 1 MiB chunks and 1024 links per node), even when all other parameters match.

### Divergence in current implementations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included Singularity as example showing that even among different "balanced" layout implementations we can see different variants that affect CID determinism for large files.

  • documented balanced and balanced-packed DAG layout variants
  • noted implicit boxo defaults for HAMT parameters that the project seems to be using
  • assumed it uses rclone's defaults for hidden files and symlinks

cc data-preservation-programs/singularity#525 @SethDocherty @parkan @2color to proofread if the "singularity" column here reflects reality or if there is more nuance to what Singularity does

Copy link
Contributor Author

@mishmosh mishmosh Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel,

Nice addition with the balanced-packed section. What you've documented there sounds correct to me. I also believe one of the design reasons for Singularity going with a balanced-packed approach is to pack chunked content as best as possible to fit into Filecoin Sectors. Up to you, but it's some additional design context to consider if it would be worth noting

The table details for Singularity look correct, too.

I'll defer to @parkan, @2color for confirmation.

lidel added a commit to ipfs/go-ipfs-cmds that referenced this pull request Jan 17, 2026
add new --dereference-symlinks boolean flag that recursively resolves
all symlinks to their target content during file collection. this works
on symlinks inside directories, not just CLI arguments.

the flag is wired through cli/parse.go to boxo's SerialFileOptions.DereferenceSymlinks.

deprecate --dereference-args which only worked on symlinks passed directly
as CLI arguments. the help text now indicates it is deprecated and directs
users to use --dereference-symlinks instead.

ref: ipfs/specs#499
lidel added a commit to ipfs/kubo that referenced this pull request Jan 17, 2026
add CLI flags for controlling file collection behavior during ipfs add:

- `--dereference-symlinks`: recursively resolve symlinks to their target
  content (replaces deprecated --dereference-args which only worked on
  CLI arguments). wired through go-ipfs-cmds to boxo's SerialFileOptions.
- `--empty-dirs` / `-E`: include empty directories (default: true)
- `--hidden` / `-H`: include hidden files (default: false)

these flags are CLI-only and not wired to Import.* config options because
go-ipfs-cmds library handles input file filtering before the directory
tree is passed to kubo. removed unused Import.UnixFSSymlinkMode config
option that was defined but never actually read by the CLI.

also:
- wire --trickle to Import.UnixFSDAGLayout config default
- update go-ipfs-cmds to v0.15.1-0.20260117043932-17687e216294
- add SYMLINK HANDLING section to ipfs add help text
- add CLI tests for all three flags

ref: ipfs/specs#499
lidel added a commit to ipfs/kubo that referenced this pull request Jan 17, 2026
add CLI flags for controlling file collection behavior during ipfs add:

- `--dereference-symlinks`: recursively resolve symlinks to their target
  content (replaces deprecated --dereference-args which only worked on
  CLI arguments). wired through go-ipfs-cmds to boxo's SerialFileOptions.
- `--empty-dirs` / `-E`: include empty directories (default: true)
- `--hidden` / `-H`: include hidden files (default: false)

these flags are CLI-only and not wired to Import.* config options because
go-ipfs-cmds library handles input file filtering before the directory
tree is passed to kubo. removed unused Import.UnixFSSymlinkMode config
option that was defined but never actually read by the CLI.

also:
- wire --trickle to Import.UnixFSDAGLayout config default
- update go-ipfs-cmds to v0.15.1-0.20260117043932-17687e216294
- add SYMLINK HANDLING section to ipfs add help text
- add CLI tests for all three flags

ref: ipfs/specs#499
lidel added a commit to ipfs/kubo that referenced this pull request Jan 17, 2026
add CLI flags for controlling file collection behavior during ipfs add:

- `--dereference-symlinks`: recursively resolve symlinks to their target
  content (replaces deprecated --dereference-args which only worked on
  CLI arguments). wired through go-ipfs-cmds to boxo's SerialFileOptions.
- `--empty-dirs` / `-E`: include empty directories (default: true)
- `--hidden` / `-H`: include hidden files (default: false)

these flags are CLI-only and not wired to Import.* config options because
go-ipfs-cmds library handles input file filtering before the directory
tree is passed to kubo. removed unused Import.UnixFSSymlinkMode config
option that was defined but never actually read by the CLI.

also:
- wire --trickle to Import.UnixFSDAGLayout config default
- update go-ipfs-cmds to v0.15.1-0.20260117043932-17687e216294
- add SYMLINK HANDLING section to ipfs add help text
- add CLI tests for all three flags

ref: ipfs/specs#499
| DAG layout | balanced |
| DAG width (children per node) | 1024 |
| HAMTDirectory fanout | 256 blocks |
| HAMTDirectory threshold | 256KiB (block-bytes) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Helia can currently do block-bytes. How much of a problem is this for making this the "modern" profile? Or does that just mean there's work to do.

https://github.com/ipfs/helia/blob/005c2a7a5e45349398cf750fd73f3c47591bb00a/packages/unixfs/src/commands/utils/is-over-shard-threshold.ts#L34-L45

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see from your table that links-bytes is the most common across the implementations, so why not just use it? It's not precise, but it means that you don't need to change implementations as much to conform.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg just mean there's work to do to implement block-bytes, but we will do it in Helia for parity with Kubo.

I hear you - standardizing on links-bytes sounds pragmatic at first. But the more I dug into this, the more I realized implementations aren't actually aligned on anything that feels solid:

  • GO used >= despite comments claiming > (just fixed in ipfs/boxo@6707376, JS uses > (imo correctly))
  • but then JS hardcodes CID byte lengths assuming sha2-256 (link). Using longer hash function would produce incorrect estimation etc. Yes, this doesn't impact the default today, but it's not a solid foundation for a modern standard, it would backfire in the future when we switch away from sha2-256

Both implementations need code changes to align on any method. There's no free lunch here.

Given that we're already introducing unixfs-v1-2025 with intentional breaking changes (1MiB chunks), this is our window to get it right, avoid ambiguity.

The goal of IPIP-499 is to create sensible modern profile that is implementable without poking your eyes out. I feel it needs block-bytes: it is unambiguous, precise, hash-agnostic, and future-proof. Legacy behavior stays in unixfs-v0-2015 for anyone who needs it, and modern software can just choose to not implement it.

(This is my "strong belief weakly held", I'm happy to be convinced otherwise :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasonable, as long as there is people-budget to do all the work on this 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYSA JS work tracked in ipfs/helia#941

- add verification process for testing profile compliance
- add test vectors for unixfs-v0-2015 profile (5 CIDs)
- add test vectors for unixfs-v1-2025 profile (5 CIDs)
- document HAMT threshold behavior (> not >=)
- document balanced vs trickle DAG layout with ASCII diagrams
- add trickle layout CID for reader compatibility testing
- reference existing test vectors from UnixFS spec (empty dir, symlink)
@lidel
Copy link
Member

lidel commented Jan 24, 2026

Update (2026-01-24)

Since the last update, we've made progress on multiple fronts:

specs (this PR)

  • Pinged Pinata and Filebase for details on how they generate DAGs
  • Added Singularity to the divergence table and documented the balanced-packed layout variant that causes CID mismatches for large files
  • Added test fixtures section with deterministic CIDs for both profiles, covering:
    • small files, files at/over max links threshold
    • directories at/over HAMT threshold
    • DAG layout verification (balanced vs trickle)

boxo

kubo

helia

Next steps

We're planning to wait for kubo 0.40 to ship with the new profile built-in, and have both profiles explicitly implemented and tested in Helia as well, before ratifying this IPIP.

Last call for feedback on unixfs-v1-2025

If you have concerns about the unixfs-v1-2025 profile parameters, now (before January ends) is the time to speak up. Once this ships in Kubo 0.40, we can always create a unixfs-v1-2026 profile with different choices, but we won't be changing the 2025 one.

@Chara-Freedom
Copy link

Chara-Freedom commented Jan 25, 2026

If you have concerns about the unixfs-v1-2025 profile parameters, now (before January ends) is the time to speak up

I can tell you as a small but relatively popular developer which stores their program in IPFS and uses own ipns republisher/update system that this is very good! The most important thing that all 0.40 kubos should use the unixfs-v1-2025 by default, and everyone better comply to that. Determinism is very good, but determinism with different implementations is even better. It was very frustrating to see for the first time that Storacha and Pinata used different cid generation mechanism. Thank you very much for all the hard work, God bless you!

lidel added 2 commits January 28, 2026 00:58
add chunk size threshold test vectors for precise boundary testing:
- file-at-chunk: file exactly at chunk size (single block)
- file-over-chunk: file +1 byte over chunk size (2 blocks)

update file-over-max-links CIDs to use +1 byte threshold instead of
+1 chunk, enabling more precise DAG rebalancing boundary tests.
- move HAMT sharding history to Motivation, add HAMT switch comparison to profiles
- rename Divergence section to Divergences across ecosystem
- merge Observed differences into balanced-packed, promote subsections to h3
- move Why sections under Compatibility

- 2026-01: [boxo#1088](https://github.com/ipfs/boxo/pull/1088) fixed threshold comparison from `>=` to `>`, aligning with JS implementation and documentation. Shipped in Kubo 0.40.

Timeline of JavaScript implementation changes:
Copy link
Member

@lidel lidel Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achingbrain added timelines for GO and JS to show all permutations. This is my understanding of the JS timeline (only included things that actually shipped).
TLDR: js-ipfs did link counting and then switched to link-bytes to be better aligned with GO, is this accurate?

👍 / 👎 if I got this right.

@parkan
Copy link

parkan commented Feb 4, 2026

@lidel sounds like we're going to need to do some work to make singularity compliant with this (ref data-preservation-programs/singularity#525) yes?

lidel added a commit to ipfs/go-ipfs-cmds that referenced this pull request Feb 4, 2026
#315)

* feat: add --dereference-symlinks flag for recursive symlink resolution

add new --dereference-symlinks boolean flag that recursively resolves
all symlinks to their target content during file collection. this works
on symlinks inside directories, not just CLI arguments.

the flag is wired through cli/parse.go to boxo's SerialFileOptions.DereferenceSymlinks.

deprecate --dereference-args which only worked on symlinks passed directly
as CLI arguments. the help text now indicates it is deprecated and directs
users to use --dereference-symlinks instead.

ref: ipfs/specs#499

* fix: make --dereference-symlinks resolve CLI arg symlinks too

--dereference-symlinks is now a superset of --dereference-args:
- resolves symlinks passed as CLI arguments (like --dereference-args)
- ALSO resolves symlinks found during directory traversal (new behavior)

this allows users to use just --dereference-symlinks instead of needing
to pass both flags for full symlink resolution.

* chore: update to rebased boxo PR

updates github.com/ipfs/boxo to 56cf0aecdc1a (feat/ipip-499-unixfs-2025 rebased on main)

* fix: reuse derefSymlinks variable, fix typo in deprecation notice

* chore: update boxo to f188f79fd412

switches to boxo@main after merging ipfs/boxo#1088
lidel added a commit to ipfs/kubo that referenced this pull request Feb 4, 2026
* feat(config): Import.* and unixfs-v1-2025 profile

implements IPIP-499: add config options for controlling UnixFS DAG
determinism and introduces `unixfs-v1-2025` and `unixfs-v0-2015`
profiles for cross-implementation CID reproducibility.

changes:
- add Import.* fields: HAMTDirectorySizeEstimation, SymlinkMode,
  DAGLayout, IncludeEmptyDirectories, IncludeHidden
- add validation for all Import.* config values
- add unixfs-v1-2025 profile (recommended for new data)
- add unixfs-v0-2015 profile (alias: legacy-cid-v0)
- remove deprecated test-cid-v1 and test-cid-v1-wide profiles
- wire Import.HAMTSizeEstimationMode() to boxo globals
- update go.mod to use boxo with SizeEstimationMode support

ref: https://specs.ipfs.tech/ipips/ipip-0499/

* feat(add): add --dereference-symlinks, --empty-dirs, --hidden CLI flags

add CLI flags for controlling file collection behavior during ipfs add:

- `--dereference-symlinks`: recursively resolve symlinks to their target
  content (replaces deprecated --dereference-args which only worked on
  CLI arguments). wired through go-ipfs-cmds to boxo's SerialFileOptions.
- `--empty-dirs` / `-E`: include empty directories (default: true)
- `--hidden` / `-H`: include hidden files (default: false)

these flags are CLI-only and not wired to Import.* config options because
go-ipfs-cmds library handles input file filtering before the directory
tree is passed to kubo. removed unused Import.UnixFSSymlinkMode config
option that was defined but never actually read by the CLI.

also:
- wire --trickle to Import.UnixFSDAGLayout config default
- update go-ipfs-cmds to v0.15.1-0.20260117043932-17687e216294
- add SYMLINK HANDLING section to ipfs add help text
- add CLI tests for all three flags

ref: ipfs/specs#499

* test(add): add CID profile tests and wire SizeEstimationMode

add comprehensive test suite for UnixFS CID determinism per IPIP-499:
- verify exact HAMT threshold boundary for both estimation modes:
  - v0-2015 (links): sum(name_len + cid_len) == 262144
  - v1-2025 (block): serialized block size == 262144
- verify HAMT triggers at threshold + 1 byte for both profiles
- add all deterministic CIDs for cross-implementation testing

also wires SizeEstimationMode through CLI/API, allowing
Import.UnixFSHAMTSizeEstimation config to take effect.

bumps boxo to ipfs/boxo@6707376 which aligns HAMT threshold with
JS implementation (uses > instead of >=), fixing CID determinism
at the exact 256 KiB boundary.

* feat(add): --dereference-symlinks now resolves all symlinks

Previously, resolving symlinks required two flags:
- --dereference-args: resolved symlinks passed as CLI arguments
- --dereference-symlinks: resolved symlinks inside directories

Now --dereference-symlinks handles both cases. Users only need one flag
to fully dereference symlinks when adding files to IPFS.

The deprecated --dereference-args still works for backwards compatibility
but is no longer necessary.

* chore: update boxo and improve changelog

- update boxo to ebdaf07c (nil filter fix, thread-safety docs)
- simplify changelog for IPIP-499 section
- shorten test names, move context to comments

* chore: update boxo to 5cf22196

* chore: apply suggestions from code review

Co-authored-by: Andrew Gillis <11790789+gammazero@users.noreply.github.com>

* test(add): verify balanced DAG layout produces uniform leaf depth

add test that confirms kubo uses balanced layout (all leaves at same
depth) rather than balanced-packed (varying depths). creates 45MiB file
to trigger multi-level DAG and walks it to verify leaf depth uniformity.

includes trickle subtest to validate test logic can detect varying depths.

supports CAR export via DAG_LAYOUT_CAR_OUTPUT env var for test vectors.

* chore(deps): update boxo to 6141039ad8ef

switches to ipfs/boxo@6141039

changes since 5cf22196ad0b:
- refactor(unixfs): use arithmetic for exact block size calculation
- refactor(unixfs): unify size tracking and make SizeEstimationMode immutable
- feat(unixfs): optimize SizeEstimationBlock and add mode/mtime tests

also clarifies that directory sharding globals affect both `ipfs add` and MFS.

* test(cli): improve HAMT threshold tests with exact +1 byte verification

- add UnixFSDataType() helper to directly check UnixFS type via protobuf
- refactor threshold tests to use exact +1 byte calculations instead of +1 file
- verify directory type directly (ft.TDirectory vs ft.THAMTShard) instead of
  inferring from link count
- clean up helper function signatures by removing unused cidLength parameter

* test(cli): consolidate profile tests into cid_profiles_test.go

remove duplicate profile threshold tests from add_test.go since they
are fully covered by the data-driven tests in cid_profiles_test.go.

changes:
- improve test names to describe what threshold is being tested
- add inline documentation explaining each test's purpose
- add byte-precise helper IPFSAddDeterministicBytes for threshold tests
- remove ~200 lines of duplicated test code from add_test.go
- keep non-profile tests (pinning, symlinks, hidden files) in add_test.go

* chore: update to rebased boxo and go-ipfs-cmds PRs

* docs: add HAMT threshold fix details to changelog

* feat(mfs): use Import config for CID version and hash function

make MFS commands (files cp, files write, files mkdir, files chcid)
respect Import.CidVersion and Import.HashFunction config settings
when CLI options are not explicitly provided.

also add tests for:
- files write respects Import.UnixFSRawLeaves=true
- single-block file: files write produces same CID as ipfs add
- updated comments clarifying CID parity with ipfs add

* feat(files): wire Import.UnixFSChunker and UnixFSDirectoryMaxLinks to MFS

`ipfs files` commands now respect these Import.* config options:
- UnixFSChunker: configures chunk size for `files write`
- UnixFSDirectoryMaxLinks: triggers HAMT sharding in `files mkdir`
- UnixFSHAMTDirectorySizeEstimation: controls size estimation mode

previously, MFS used hardcoded defaults ignoring user config.

changes:
- config/import.go: add UnixFSSplitterFunc() returning chunk.SplitterGen
- core/node/core.go: pass chunker, maxLinks, sizeEstimationMode to
  mfs.NewRoot() via new boxo RootOption API
- core/commands/files.go: pass maxLinks and sizeEstimationMode to
  mfs.Mkdir() and ensureContainingDirectoryExists(); document that
  UnixFSFileMaxLinks doesn't apply to files write (trickle DAG limitation)
- test/cli/files_test.go: add tests for UnixFSDirectoryMaxLinks and
  UnixFSChunker, including CID parity test with `ipfs add --trickle`

related: boxo@54e044f1b265

* feat(files): wire Import.UnixFSHAMTDirectoryMaxFanout and UnixFSHAMTDirectorySizeThreshold

wire remaining HAMT config options to MFS root:
- Import.UnixFSHAMTDirectoryMaxFanout via mfs.WithMaxHAMTFanout
- Import.UnixFSHAMTDirectorySizeThreshold via mfs.WithHAMTShardingSize

add CLI tests:
- files mkdir respects Import.UnixFSHAMTDirectoryMaxFanout
- files mkdir respects Import.UnixFSHAMTDirectorySizeThreshold
- config change takes effect after daemon restart

add UnixFSHAMTFanout() helper to test harness

update boxo to ac97424d99ab90e097fc7c36f285988b596b6f05

* fix(mfs): single-block files in CIDv1 dirs now produce raw CIDs

problem: `ipfs files write` in CIDv1 directories wrapped single-block
files in dag-pb even when raw-leaves was enabled, producing different
CIDs than `ipfs add --raw-leaves` for the same content.

fix: boxo now collapses single-block ProtoNode wrappers (with no
metadata) to RawNode in DagModifier.GetNode(). files with mtime/mode
stay as dag-pb since raw blocks cannot store UnixFS metadata.

also fixes sparse file writes where writing past EOF would lose data
because expandSparse didn't update the internal node pointer.

updates boxo to v0.36.1-0.20260203003133-7884ae23aaff
updates t0250-files-api.sh test hashes to match new behavior

* chore(test): use Go 1.22+ range-over-int syntax

* chore: update boxo to c6829fe26860

- fix typo in files write help text
- update boxo with CI fixes (gofumpt, race condition in test)

* chore: update go-ipfs-cmds to 192ec9d15c1f

includes binary content types fix: gzip, zip, vnd.ipld.car, vnd.ipld.raw,
vnd.ipfs.ipns-record

* chore: update boxo to 0a22cde9225c

includes refactor of maxLinks check in addLinkChild (review feedback).

* ci: fix helia-interop and improve caching

skip '@helia/mfs - should have the same CID after creating a file' test
until helia implements IPIP-499 (tracking: ipfs/helia#941)

the test fails because kubo now collapses single-block files to raw CIDs
while helia explicitly uses reduceSingleLeafToSelf: false

changes:
- run aegir directly instead of helia-interop binary (binary ignores --grep flags)
- cache node_modules keyed by @helia/interop version from npm registry
- skip npm install on cache hit (matches ipfs-webui caching pattern)

* chore: update boxo to 1e30b954

includes latest upstream changes from boxo main

* chore: update go-ipfs-cmds to 1b2a641ed6f6

* chore: update boxo to f188f79fd412

switches to boxo@main after merging ipfs/boxo#1088

* chore: update go-ipfs-cmds to af9bcbaf5709

switches to go-ipfs-cmds@master after merging ipfs/go-ipfs-cmds#315

---------

Co-authored-by: Andrew Gillis <11790789+gammazero@users.noreply.github.com>
@lidel
Copy link
Member

lidel commented Feb 5, 2026

@parkan Not for me to say, it is up to the project. The difference (balanced vs balanced-packed) is documented in the IPIP, so Singularity could adopt it or state "we made concious decision to use balanced-packed as defined in IPIP-499" and call it a day.

Whether matching future Kubo 1.0 CIDs (the unixfs-v1-2025 profile) matters depends on your use case. This isn't a holy grail: many projects will keep using specialized DAG layouts for their specific needs (e.g., WebRecorder uses a custom chunker for WARC files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.